Main research aim
In our research, we aim to explore various forecasting models and methodologies to improve the prediction accuracy of time series data, specifically focusing on seasonal patterns and the integration of external regressors. We have started by delving into Dynamic Harmonic Regression (DHR) and have particularly consulted the following articles:
Rausch, T. M., Albrecht, T., & Baier, D. (2022). Beyond the beaten paths of forecasting call center arrivals: On the use of dynamic harmonic regression with predictor variables. Journal of Business Economics, 92, 675--706.
Ta-Hsin Li (2003). A Hierarchical Framework for Modeling and Forecasting Web Server Workload
In addition, we intend to further investigate tree-based methods by referencing:
Vercauteren, T., Aggarwal, P., Wang, X., & Li, T.-H. (2007). Hierarchical Forecasting of Web Server Workload Using Sequential Monte Carlo Training. IEEE Transactions on Signal Processing, 55, 1162-1184.
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C.B. (1984). Classification and Regression Trees. Wadsworth, Belmont.
We plan to utilize classification and regression trees (CART) as a dimensionality reduction technique and to understand how these methods can be applied to improve our forecasting models.
suppressPackageStartupMessages({
library(dplyr)
library(lubridate)
library(tsibble)
# dhReg ----------------------------------
require(dhReg)
library(dhReg)
})
# Multivariate time series -----------------------
suppressMessages({
ts <- data_2022 %>%
group_by(`Data di Presentazione`,
`Tipo di segnalazione - Descrizione`,
`Descrizione Area Tematica`) %>%
summarise(Numero_Segnalazioni = n())
})
# TS --------------------------------------
# Univariate time series of the cumulative numner of complaints for day
ts_univariate <- data_2022 %>%
group_by(`Data di Presentazione`) %>%
summarise(Numero_Segnalazioni = n()) %>%
ungroup()
# First DHR models ---------------------------------
# Fitting the total number of complaints with a simple DHR model
res1<-dhr(ts_univariate$Numero_Segnalazioni,
XREG = NULL, # no indipendent variable
Range = list(1:2, 1), # Range of terms for Fourier's model
Frequency = c(24, 168), # Seasonality (dayly and weekly)
Criteria = "aicc", # Model selection criteria
maxp = 5, # Max order of AR in auto.arima model
maxq = 5, # Max order of MA in auto.arima model
maxd = 5) # Max differentiation (default setting)
res1
## Series: Data
## Regression with ARIMA(2,1,5) errors
## Box Cox transformation: lambda= 0
##
## Coefficients:
## ar1 ar2 ma1 ma2 ma3 ma4 ma5 S1-24
## 0.8625 -0.7088 -1.5797 1.1844 -0.2037 -0.5564 0.3084 -0.0659
## s.e. 0.0725 0.0625 0.0817 0.1469 0.1623 0.1318 0.0624 0.0267
## C1-24 S2-24 C2-24 S1-168 C1-168
## 0.0588 0.0644 -0.0093 0.038 0.3045
## s.e. 0.0267 0.0294 0.0295 0.125 0.1277
##
## sigma^2 = 0.1234: log likelihood = -130.38
## AIC=288.76 AICc=289.96 BIC=343.32
# Forecast DHRregression-------------------------------------
?fc
pred1<-fc(Frequency = c(24, 168) , XREG_test = NULL, h=10,
Fit=res1, Data=ts_univariate$Numero_Segnalazioni)
plot(pred1)
ARIMA(2,1,5) specifies the autoregressive (AR) order as 2, the differencing (I) order as 1, and the moving average (MA) order as 5.
Box Cox transformation: lambda= 0: The Box Cox transformation is applied to stabilize the variance of the time series. Here, lambda=0 suggests a log transformation.
Coefficients:
- ar1, ar2, ma1, ma2, ma3, ma4, ma5: These are coefficients for the autoregressive (AR) and moving average (MA) terms of the ARIMA model. They capture the temporal dependencies in the time series data.
- S1-24, C1-24, S2-24, C2-24, S1-168, C1-168: These are coefficients for the Fourier terms, representing hourly (24) and weekly (168) seasonal patterns.
Model Statistics: - sigma^2 = 0.1234: This is the estimated variance of the error term in the model. - log likelihood = -130.39: Higher log-likelihood values indicate better fit. - AIC (Akaike Information Criterion) = 288.78: Lower AIC values indicate better models. - AICc (corrected AIC) = 289.98: AICc adjusts AIC for small sample sizes. - BIC (Bayesian Information Criterion) = 343.34: Similar to AIC, BIC also balances goodness of fit and model complexity, but it penalizes more complex models more heavily. Lower BIC values indicate better models.
Overall, this model suggests that the time series data has significant autoregressive and moving average dependencies, as well as hourly and weekly seasonal patterns. The statistics indicate that the model provides a reasonable fit to the data, with seasonality being a predominant component. Indeed, all 5 out of 5 seasonal components were needed to minimize the fit. For this reason, trying a SARIMA model will also be planned.
Before moving to different models, we wanted to use the information by passing the other variables in the dataset as regressors through the variable xreg. This step requires further pre-processing of the data. We follow a variation of One Hot Encoding, not with a 0-1 matrix but with the cumulative sums of the complaints with respect to the corresponding column.
# One Hot Encoding -------------------------------
suppressPackageStartupMessages({
library(tidyr)
})
# Expand the matrix
# 1. Aggregate the data to count the reports for each combination of date, report type, and thematic area
aggregated_data <- data_2022 %>%
group_by(`Data di Presentazione`, `Tipo di segnalazione - Descrizione`, `Descrizione Area Tematica`) %>%
summarise(Numero_Segnalazioni = n(), .groups = 'drop')
# 2. Use pivot_wider to transform the data into a wide format
wide_data <- aggregated_data %>%
pivot_wider(names_from = c(`Tipo di segnalazione - Descrizione`, `Descrizione Area Tematica`),
values_from = Numero_Segnalazioni,
values_fill = list(Numero_Segnalazioni = 0))
wide_data_NODATE <- wide_data[, 2:43] %>%
mutate(across(everything(), as.numeric))
res2 <- dhr(ts_univariate$Numero_Segnalazioni,
XREG = as.matrix(wide_data_NODATE),
Range = list(1:2, 1), # Range of the Fourier terms
Frequency = c(24, 168), # Seasonal frequencies (hourly and weekly)
Criteria = "aicc", # Model selection criterion
maxp = 5, # Maximum AR order in auto.arima
maxq = 5, # Maximum MA order in auto.arima
maxd = 5) # Maximum differencing order in auto.arima
print(res2)
## Series: Data
## Regression with ARIMA(2,0,0) errors
## Box Cox transformation: lambda= 0
##
## Coefficients:
## ar1 ar2 intercept S1-24 C1-24 S1-168 C1-168
## 0.4653 -0.0394 4.4409 -0.0196 0.0041 -0.0060 -0.0029
## s.e. 0.0635 0.0617 0.0340 0.0178 0.0179 0.0213 0.0205
## AMA LineaVerde_GESTIONE RIFIUTI E PULIZIA URBANA
## 3e-03
## s.e. 1e-04
## Reclamo_OPERE E MANUTENZIONE DELLA CITTA Reclamo_SERVIZI MUNICIPALI
## 0.0122 -0.0006
## s.e. 0.0042 0.0031
## Reclamo_SICUREZZA URBANA Segnalazione_ANAGRAFE E SERVIZI CIVICI
## 0.0065 0.0117
## s.e. 0.0063 0.0069
## Segnalazione_MOBILITA E TRASPORTI
## 0.0021
## s.e. 0.0029
## Segnalazione_OPERE E MANUTENZIONE DELLA CITTA
## 0.0063
## s.e. 0.0007
## Segnalazione_SERVIZI MUNICIPALI Segnalazione_SICUREZZA URBANA
## 0.0047 0.0049
## s.e. 0.0018 0.0007
## Segnalazione_TURISMO Reclamo_GESTIONE RIFIUTI E PULIZIA URBANA
## -0.0168 -0.0057
## s.e. 0.0243 0.0087
## Reclamo_MOBILITA E TRASPORTI Reclamo_TRIBUTI E CONTRAVVENZIONI
## -0.0013 0.0097
## s.e. 0.0038 0.0068
## Segnalazione_AMBIENTE Segnalazione_GESTIONE RIFIUTI E PULIZIA URBANA
## 0.0034 0.0090
## s.e. 0.0008 0.0061
## Segnalazione_INNOVAZIONE E SMART CITY Segnalazione_PATRIMONIO
## -0.0079 -0.0075
## s.e. 0.0092 0.0132
## Segnalazione_TRIBUTI E CONTRAVVENZIONI Reclamo_ANAGRAFE E SERVIZI CIVICI
## 0.0110 0.0061
## s.e. 0.0042 0.0065
## Reclamo_CASA E URBANISTICA Reclamo_INNOVAZIONE E SMART CITY
## 0.0006 -0.0111
## s.e. 0.0077 0.0138
## Reclamo_SCUOLA Reclamo_SOCIALE Segnalazione_CASA E URBANISTICA
## -0.0045 0.0208 8e-04
## s.e. 0.0196 0.0099 5e-03
## Segnalazione_CULTURA Segnalazione_SOCIALE Reclamo_AMBIENTE
## 0.0004 -0.0103 0.0004
## s.e. 0.0057 0.0216 0.0051
## Segnalazione_COMMERCIO E IMPRESA Segnalazione_SPORT Segnalazione_SCUOLA
## 0.0075 -0.0070 0.0097
## s.e. 0.0069 0.0337 0.0017
## Segnalazione_DIRITTI E PARI OPPORTUNITA Reclamo_CULTURA
## 0.0141 0.0264
## s.e. 0.0050 0.0109
## Reclamo_COMMERCIO E IMPRESA Reclamo_DIRITTI E PARI OPPORTUNITA
## 0.0353 0.004
## s.e. 0.0235 0.008
## Reclamo_SPORT Segnalazione_ALTRI SERVIZI Reclamo_ALTRI SERVIZI
## -0.0076 0.0098 0.0200
## s.e. 0.0448 0.0239 0.0564
## Segnalazione_GESTIONE DELL'ENTE Reclamo_GESTIONE DELL'ENTE
## 0.0015 0.0236
## s.e. 0.0259 0.0370
## Reclamo_PATRIMONIO Reclamo_TURISMO Segnalazione_SANITA E SALUTE
## -0.0287 -0.0283 -0.0624
## s.e. 0.0337 0.0835 0.0607
##
## sigma^2 = 0.02276: log likelihood = 198.61
## AIC=-297.21 AICc=-280.97 BIC=-102.22
# Forecast DHRregression2-------------------------------------
pred2<-fc(Frequency = c(24, 168), h=365, XREG_test = as.matrix(wide_data_NODATE),
Fit=res2, Data=ts_univariate$Numero_Segnalazioni)
# Forecast with DHR Regression
plot(pred2)
This second model is coherent but depends on the matrix. It is highly probable that some sections are too sparse to yield better performance. We should consider either merging some sections if it makes logical sense, removing them, or applying feature engineering.
Following the Professor’s suggestion, we tried a very simple ARIMA model, which is essentially a previous version of DHR.